Goto

Collaborating Authors

 Middle School


MinorBench: A hand-built benchmark for content-based risks for children

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are rapidly entering children's lives -- through parent-driven adoption, schools, and peer networks -- yet current AI ethics and safety research do not adequately address content-related risks specific to minors. In this paper, we highlight these gaps with a real-world case study of an LLMbased chatbot deployed in a middle school setting, revealing how students used and sometimes misused the system. We evaluate six prominent LLMs under different system prompts, demonstrating substantial variability in their childsafety compliance. Our results inform practical steps for more robust, childfocused safety mechanisms and underscore the urgency of tailoring AI systems to safeguard young users. Large Language Models (LLMs) have seen rapid adoption in educational settings, with both teachers and students recognizing their potential for personalized feedback and instant instructional support. Recent surveys indicate that over half of K-12 teachers in some regions now use LLMs for lesson planning, grading assistance, or creative class activities, while approximately onethird of students--some as young as 12--have experimented with such models for schoolwork (Common Sense Media, 2024). However, the emergence of LLMs in schools raises concerns about children's vulnerability. Children are still developing critical thinking skills, often place higher trust in authoritative-sounding answers, and may not fully understand an AI's limitations.


Privacy-Preserved Automated Scoring using Federated Learning for Educational Research

arXiv.org Artificial Intelligence

Data privacy remains a critical concern in educational research, necessitating Institutional Review Board (IRB) certification and stringent data handling protocols to ensure compliance with ethical standards. Traditional approaches rely on anonymization and controlled data-sharing mechanisms to facilitate research while mitigating privacy risks. However, these methods still involve direct access to raw student data, posing potential vulnerabilities and being time-consuming. This study proposes a federated learning (FL) framework for automatic scoring in educational assessments, eliminating the need to share raw data. Our approach leverages client-side model training, where student responses are processed locally on edge devices, and only optimized model parameters are shared with a central aggregation server. To effectively aggregate heterogeneous model updates, we introduce an adaptive weighted averaging strategy, which dynamically adjusts weight contributions based on client-specific learning characteristics. This method ensures robust model convergence while preserving privacy. We evaluate our framework using assessment data from nine middle schools, comparing the accuracy of federated learning-based scoring models with traditionally trained centralized models. A statistical significance test (paired t-test, $t(8) = 2.29, p = 0.051$) confirms that the accuracy difference between the two approaches is not statistically significant, demonstrating that federated learning achieves comparable performance while safeguarding student data. Furthermore, our method significantly reduces data collection, processing, and deployment overhead, accelerating the adoption of AI-driven educational assessments in a privacy-compliant manner.


Beyond One-Size-Fits-All Summarization: Customizing Summaries for Diverse Users

arXiv.org Artificial Intelligence

In recent years, automatic text summarization has witnessed significant advancement, particularly with the development of transformer-based models. However, the challenge of controlling the readability level of generated summaries remains an under-explored area, especially for languages with complex linguistic features like Turkish. This gap has the effect of impeding effective communication and also limits the accessibility of information. Controlling readability of textual data is an important element for creating summaries for different audiences with varying literacy and education levels, such as students ranging from primary school to graduate level, as well as individuals with diverse educational backgrounds. Summaries that align with the needs of specific reader groups can improve comprehension and engagement, ensuring that the intended message is effectively communicated. Furthermore, readability adjustment is essential to expand the usability of summarization models in educational and professional domains. Current summarization models often don't have the mechanisms to adjust the complexity of their outputs, resulting in summaries that may be too simplistic or overly complex for certain types of reader groups. Developing adaptive models that can tailor content to specific readability levels is therefore crucial. To address this problem, we create our own custom dataset and train a model with our custom architecture. Our method ensures that readability levels are effectively controlled while maintaining accuracy and coherence. We rigorously compare our model to a supervised fine-tuned baseline, demonstrating its superiority in generating readability-aware summaries.


Carelessness Detection using Performance Factor Analysis: A New Operationalization with Unexpectedly Different Relationship to Learning

arXiv.org Artificial Intelligence

--Detection of carelessness in digital learning platforms has relied on the contextual slip model, which leverages conditional probability and Bayesian Knowledge Tracing (BKT) to identify careless errors, where students make mistakes despite having the knowledge. However, this model cannot effectively assess carelessness in questions tagged with multiple skills due to the use of conditional probability. This limitation narrows the scope within which the model can be applied. Thus, we propose a novel model, the Beyond-Knowledge Feature Carelessness (BKFC) model. The model detects careless errors using performance factor analysis (PF A) and behavioral features distilled from log data, controlling for knowledge when detecting carelessness. We applied the BKFC to detect carelessness in data from middle school students playing a learning game on decimal numbers and operations. We conducted analyses comparing the careless errors detected using contextual slip to the BKFC model. Unexpectedly, careless errors identified by these two approaches did not align. We found students' post-test performance was (corresponding to past results) positively associated with the carelessness detected using the contextual slip model, while negatively associated with the carelessness detected using the BKFC model. These results highlight the complexity of carelessness and underline a broader challenge in operationalizing carelessness and careless errors. Academic discussions of carelessness in classrooms date back to the 1950s [1]. Often viewed as the result of ineffective self-regulation, carelessness is thought to occur when students commit hurried or impulsive behaviors that result in mistakes on problems that could have been answered correctly. By distinguishing mistakes made due to carelessness from those caused by other factors, such as lack of knowledge, adaptive instruction can be provided to engage or reengage students in the effective use of self-regulation during the process of problem-solving. In the last several decades, two streams of work have run in parallel to investigate carelessness and detect careless behaviors.


Implementation of a Generative AI Assistant in K-12 Education: The CGScholar AI Helper Initiative

arXiv.org Artificial Intelligence

This paper focuses on the piloting of the CGScholar AI Helper, a Generative AI (GenAI) assistant tool that aims to provide feedback on writing in high school contexts. The aim was to use GenAI to provide formative and summative feedback on students' texts in English Language Arts (ELA) and History. The trials discussed in this paper relate to Grade 11, a crucial learning phase when students are working towards college readiness. These trials took place in two very different schools in the Midwest of the United States, one in a low socio-economic background with low-performance outcomes and the other in a high socio-economic background with high-performance outcomes. The assistant tool used two main mechanisms "prompt engineering" based on participant teachers' assessment rubric and "fine-tuning" a Large Language Model (LLM) from a customized corpus of teaching materials using Retrieval Augmented Generation (RAG). This paper focuses on the CGScholar AI Helper's potential to enhance students' writing abilities and support teachers in ELA and other subject areas requiring written assignments.


Evaluating GenAI for Simplifying Texts for Education: Improving Accuracy and Consistency for Enhanced Readability

arXiv.org Artificial Intelligence

Generative artificial intelligence (GenAI) holds great promise as a tool to support personalized learning. Teachers need tools to efficiently and effectively enhance content readability of educational texts so that they are matched to individual students reading levels, while retaining key details. Large Language Models (LLMs) show potential to fill this need, but previous research notes multiple shortcomings in current approaches. In this study, we introduced a generalized approach and metrics for the systematic evaluation of the accuracy and consistency in which LLMs, prompting techniques, and a novel multi-agent architecture to simplify sixty informational reading passages, reducing each from the twelfth grade level down to the eighth, sixth, and fourth grade levels. We calculated the degree to which each LLM and prompting technique accurately achieved the targeted grade level for each passage, percentage change in word count, and consistency in maintaining keywords and key phrases (semantic similarity). One-sample t-tests and multiple regression models revealed significant differences in the best performing LLM and prompt technique for each of the four metrics. Both LLMs and prompting techniques demonstrated variable utility in grade level accuracy and consistency of keywords and key phrases when attempting to level content down to the fourth grade reading level. These results demonstrate the promise of the application of LLMs for efficient and precise automated text simplification, the shortcomings of current models and prompting methods in attaining an ideal balance across various evaluation criteria, and a generalizable method to evaluate future systems.


Democratizing Signal Processing and Machine Learning: Math Learning Equity for Elementary and Middle School Students

arXiv.org Artificial Intelligence

Signal Processing (SP) and Machine Learning (ML) rely on good math and coding knowledge, in particular, linear algebra, probability, and complex numbers. A good grasp of these relies on scalar algebra learned in middle school. The ability to understand and use scalar algebra well, in turn, relies on a good foundation in basic arithmetic. Because of various systemic barriers, many students are not able to build a strong foundation in arithmetic in elementary school. This leads them to struggle with algebra and everything after that. Since math learning is cumulative, the gap between those without a strong early foundation and everyone else keeps increasing over the school years and becomes difficult to fill in college. In this article we discuss how SP faculty and graduate students can play an important role in starting, and participating in, university-run (or other) out-of-school math support programs to supplement students' learning. Two example programs run by the authors (CyMath at ISU and Ab7G at Purdue) are briefly described. The second goal of this article is to use our perspective as SP, and engineering, educators who have seen the long-term impact of elementary school math teaching policies, to provide some simple almost zero cost suggestions that elementary schools could adopt to improve math learning: (i) more math practice in school, (ii) send small amounts of homework (individual work is critical in math), and (iii) parent awareness (math resources, need for early math foundation, clear in-school test information and sharing of feedback from the tests). In summary, good early math support (in school and through out-of-school programs) can help make SP and ML more accessible.


AI-powered deepfake nude websites are targeted by San Francisco city attorney's lawsuit

Los Angeles Times

David Chiu announced Thursday that his office is suing the operators of 16 A.I.-powered "undressing" websites that help users create and distribute deepfake nude photos of women and girls. The lawsuit, which city officials said was the first of its kind, accuses the websites' operators of violating state and federal laws that ban deepfake pornography, revenge pornography and child pornography, as well as California's unfair competition law. The names of the sites were redacted in the copy of the suit made public Thursday. Chiu's office has yet to identify the owners of many of the websites, but officials say they hope to find their names and hold them accountable. Chiu said the lawsuit has two goals: shutting down these websites and sounding the alarm about this form of "sexual abuse."


MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification

arXiv.org Artificial Intelligence

To advance the evaluation of multimodal math reasoning in large multimodal models (LMMs), this paper introduces a novel benchmark, MM-MATH. MM-MATH consists of 5,929 open-ended middle school math problems with visual contexts, with fine-grained classification across difficulty, grade level, and knowledge points. Unlike existing benchmarks relying on binary answer comparison, MM-MATH incorporates both outcome and process evaluations. Process evaluation employs LMM-as-a-judge to automatically analyze solution steps, identifying and categorizing errors into specific error types. Extensive evaluation of ten models on MM-MATH reveals significant challenges for existing LMMs, highlighting their limited utilization of visual information and struggles with higher-difficulty problems. The best-performing model achieves only 31% accuracy on MM-MATH, compared to 82% for humans. This highlights the challenging nature of our benchmark for existing models and the significant gap between the multimodal reasoning capabilities of current models and humans. Our process evaluation reveals that diagram misinterpretation is the most common error, accounting for more than half of the total error cases, underscoring the need for improved image comprehension in multimodal reasoning.


Free-text Rationale Generation under Readability Level Control

arXiv.org Artificial Intelligence

Free-text rationales justify model decisions in natural language and thus become likable and accessible among approaches to explanation across many tasks. However, their effectiveness can be hindered by misinterpretation and hallucination. As a perturbation test, we investigate how large language models (LLMs) perform the task of natural language explanation (NLE) under the effects of readability level control, i.e., being prompted for a rationale targeting a specific expertise level, such as sixth grade or college. We find that explanations are adaptable to such instruction, but the requested readability is often misaligned with the measured text complexity according to traditional readability metrics. Furthermore, the quality assessment shows that LLMs' ratings of rationales across text complexity exhibit a similar pattern of preference as observed in natural language generation (NLG). Finally, our human evaluation suggests a generally satisfactory impression on rationales at all readability levels, with high-school-level readability being most commonly perceived and favored.